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Abstract 

We consider the problem of learning the structure of a pairwise graphical model over 
continuous and discrete variables. We present a new pairwise model for graphical models 
with both continuous and discrete variables. The structure and parameters of this model 
are learned using the pseudo-likelihood approximation with group-sparsity regularization. 
The pairwise model is also extended to incorporate features. Two algorithms for solving 
the resulting optimization problem are presented. The proposed models are compared with 
competing methods on synthetic data and a survey dataset. 

Keywords: mixed graphical models, Markov random field, l\ regularization, structure 
learning 



1. Introduction 



Many authors have considered the problem of learning the edge structure and parameters 
of sparse undirected graphical models. We will focus on using the l\ regularizer to promote 
sparsity. This line of work has taken two separate paths: one for learning continuous valued 
data and one for learning discrete valued data. In this work, we consider learning mixed 
models with both continuous variables and discrete variables. 

For continuous models, previous work assumes that the model is a multivariate Gaussian 
(Gaussian graphical model) with mean and inverse covariance G. is then estimated 
via the graphical lasso by minimizing the regularized negative log- likelihood ^(0) + A ||0||i- 



Several efficient methods for solving this can be found in (Friedman et al. |2008a| Banerjee 



et al. 2008). Because the graphical lasso problem is computationally challenging, several 



authors considered methods related to the pseudolikelihood (PL) and node-wise l\ least 
squares ( Meinshausen and Biihlmann 2006 Friedman et al. , 12010 Peng et al. 2009). For 



discrete models, previous work focuses on estimating a pairwise Markov random field of 
the form p(y) oc exp Y2 r <j 4>rj(yr, Vj)- The maximum likelihood problem is intractable 
for models with a moderate to large number of variables (high-dimensional) because its 
requires evaluating the partition function and its derivatives. Previous work considers the 



pseudolikelihood approach and the maximum likelihood approach (Schmidt, 2010 Schmidt 
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Our main contribution is to propose a model that connects the discrete and continuous 
models. Our model in the case of discrete variables, is a pairwise Markov random field and 
in the case of continuous variables is a Gaussian graphical model. In Section 2, we introduce 
a new mixed graphical model and previous approaches to modeling mixed data. Section 
3 discusses the likelihood and the pseudolikelihood approaches to parameter estimation. 
Section 4 discusses how to use the l\ penalty for edge selection, Section 5 presents the 
pseudolikelihood for the mixed model and its relationship to generalized linear models and 
Section 6 presents empirical results on synthetic and real datasets. 



2. Mixed Graphical Model 

We propose a pairwise graphical model on continuous and discrete variables. The continuous 
variables follow a multivariate Gaussian and the discrete variables follow a pairwise Markov 
random field model 



(v v j v v q q <? 

^2 $Z ~ 2^tx s x t + ^2 a s x s + ^2^2 P s J^yj) Xs + ^2 ^2 
s=l t=l s=l s=l j=l j=l r=l 



<t>rj(yr,Vj) 



(1) 



for a graphical model with p continuous variables and q discrete variables. The joint model 
is parametrized by = [{/3 s t}, {a s }, {p s j}, {</Vj}Q The discrete y r takes on L r states. 
The model parameters are (3 s t continuous-continuous edge potential, a s continuous node 
potential, p s j(yj) continuous-discrete edge potential, and 4> r j{yr,yj) discrete-discrete edge 
potential. In the case of continuous variables only, we recover the familiar multivariate 
Gaussian parametrized by the symmetric positive-definite inverse covariance matrix B = 
{f3 s t}, a symmetric positive definite matrix. In the case of discrete variables only, we recover 
the pairwise Markov random field parametrized by edge potentials <ft r j and node potentials 
<j) rr . 

The conditional distribution of the continuous variables given the discrete follow a mul- 
tivariate gaussian distribution, p(x\y) = Af(fJ>(y), B' 1 ). Each of these gaussian distributions 
share the same inverse covariance matrix B but differ in the mean parameter. By standard 
multivariate gaussian calculations, 

p(x\y)=N(B- 1 1 (y),B- 1 ) (2) 

{i(y)}s = ^2p S j(yj) (3) 

( q I i \ 

p(y) oc exp I (f>rj(y r , yj) + ^-y(y) T B ( 4 ) 

\j=l r=l - I 



1. psj(yj) is a function taking Lj values p s j(l), ■ ■ ■ , p s j(Lj). Similarly, 0rj(yr,J/j) is a bivariate function 
taking on L r x Lj values. Later, we will think of PsjiVj) as a vector of length Lj and 4>rj{yr,yj) as a 
matrix of size L r x Lj. 
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Thus we see that the continuous variables conditioned on the discrete are multivariate 
gaussian with common covariance, but with means that depend on the value of the dis- 
crete variables. The means depend additively on the values of the discrete variables since 
{l(y)}s = ^2j=\ PsjilJj)- The marginal p(y) has a known form, so for models with few num- 
ber of discrete variables we can sample efficiently. The single conditionals p(x s \x\ s ,y) is a 
gaussian linear regression and p(y r \x,y\ r ) is a (multiclass) logistic regression. We discuss 
this further in Section [3.11 



2.1 Previous work on mixed graphical models 



In Lauritzen (1996), a type of mixed graphical model is proposed. This model has the 
property that conditioned on discrete variables, p{x\y) = Af(p,(y), The homogeneous 

mixed graphical model enforces common covariance, = S. Thus our proposed model 
is a special case of Lauritzen's mixed model with the following assumptions: common co- 
variance, additive mean assumptions and the marginal p(y) factorizes as a pairwise discrete 
Markov random field. With these two assumptions, the full model simplifies to the pairwise 
model presented. Although the full model is more expressive, the number of parameters 
scales exponentially with the number of discrete variables. For each state of the discrete 
variables there is a mean and covariance. Consider an example with q binary variables and 
p continuous variables; the full model requires estimates of 2 q mean vectors and covariance 
matrices in p dimensions. Even if the homogeneous constraint is imposed on Lauritzen's 
model, there are still 2 q mean vectors for the case of binary discrete variables. In compari- 
son, the pairwise model has number of parameters 0((p + q) 2 ). 

3. Parameter Estimation: Maximum Likelihood and Pseudolikelihood 

Given samples (xi, yi)f = i, we want to find the maximum likelihood estimate of 0. This can 
be done by minimizing the negative log-likelihood of the samples: 

n 

= l °gp( x i, Vu ©) where (5) 

p p p p q 

logp(x,y;@) = ^2^2--f3 st x s x t + ^2a s x s + Psjiv. 



i=l 

/' P , /' /' 7 

K p S j{yj)x s 

8=1 t=l ~ 8=1 S=l j = l 

1 j 

+ ^5>rj(Vr,i&)-logZ(e) (6) 
j=l r=l 

The negative log-likelihood is convex, so standard gradient-descent algorithms can be used 
for computing the maximum likelihood estimates. The difficulty is that in each gradient 
step we have to compute T(x,y) — E p (Q\ [T(x,y)], the difference between the empirical 
sufficient statistic T(x, y) and the expected sufficient statistic. In both continuous and 
discrete graphical models the computationally expensive step is evaluating E p /Q\ [T(x,y)]. 
In discrete problems, this involves a sum over the discrete state space and in continuous 
problem, this requires matrix inversion. For both discrete and continuous models, there 
has been much work on on addressing these difficulties. For discrete models, the junction 
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tree algorithm is an exact method for evaluating marginals and is suitable for models with 
low treewidth. Variational methods such as belief propagation and tree reweighted belief 
propagation work by optimizing a surrogate likelihood function by approximating the par- 
tition function Z(&) by a tractable surrogate Z{Q) (Wainwright and Jordan, 2008). In 



the case of a large discrete state space, these methods can be used to approximate p(y) 
and do approximate maximum likelihood estimation for the discrete model. Approximate 
maximum likelihood estimation can also be done via monte carlo estimates of the gradients 
T(x,y) — E p fQ\(T(x,y)). For continuous Gaussian graphical models, efficient algorithms 
based on block coordinate descent (Friedman et al. , 2008b Banerjee et al. , 2008) have been 



developed, that do not require matrix inversion. Since the pairwise mixed model includes 
both the discrete and continuous models as special cases, maximum likelihood estimation 
is at least as difficult as the two special cases. We defer the specifics of sampling from the 
model and maximum likelihood estimation to Appendices [A| and [B| 



3.1 Pseudolikelihood 



The pseudolikelihood method Besag (1975) is a computationally efficient surrogate likeli- 



hood method formed by products of all the conditional distributions: 



£{Q\x,y) = -J2 l °SP(xs\x\ s ,y;@) - ^ log p{y r \x, y\ r ; 9) 



(7) 



r=l 



The conditional distributions p(x s \x\ s , y; 6) and p(y r = k\y\ r> , x; 6) take on the familiar form 
of linear gaussian and (multiclass) logistic regression. 

• For a discrete variable y r with L r states, its conditional distribution is a multinomial 
distribution, as used in (multiclass) logistic regression. Whenever a discrete variable 
is a predictor, each level contributes an additive effect; continuous variables contribute 
linear effects. 



ex P (£s Psr{yr)x s + 4>rr(yr, Vr) + Ej^r firjiVr, Vj)j 



p{y r \y\ r ,, X] &) 

££l ex P (£ s Psr{l)x s + 4>rr(l, I) + Ys&r <M^> 

This can be expressed in the more familiar form as 



P(y r = k) 



exp (a%z) 



exp f a k + Ej at k j; 



E;=i exp (afz) J^ti ex P («W + Ej ^0 / 
The discrete variables are represented as dummy variables for each state Zj = 1 [y u = k] . 

Continuous variable x s given all other variables is a gaussian distribution with a linear 
regression model for the mean. 

, , ^ VK~ S (-Pss f a s + Ejpsj(yj)-Et^s^tx t 

P{Xs\x\ s , y; 9) = —f= exp 1 1 



2tt 



1% 
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This can be expressed as 

E(x s \zi, ...,z p ) = a T z = a + zjOj (8) 



3 



p(x s \z 1 , . . . ,z p ) = J— exp (-^(x s - a T z) 2 ) with a (9) 



Taking the negative log of both gives us 



- lo gP (x s \x\ s , y; 9) = -\ log 0„ + ^ £ *M - £ |* a* - x s ] (10) 



ex P (Es Psr(y r )x s + 0rr(yr, 2/r) + Ej^r firjiVr, Vj) 

-logp(y r \y\ rj ,x;&) = -log — ^ ^ ^ (11) 

Ei=l ex P (Es/ ? sr(/)^ + 0rr(U) + Ej^r <t>rj(l,yj) 



The discrete conditional — logp(y r \y\ ri , x; G) is a standard (multiclass) logistic regression 
and thus convex in p sr and (p r j. The continuous conditional — logp(x s \x\ s , y; G) is gaussian 
linear regression with unknown variance. This function is convex in f3 s t, p s j(yj), and f3 ss 
because x 2 /y is jointly convex in x and y > 0. Thus the terms (3g t /(3 ss and p 2 j/f3 ss are 

convex. From this we can see that the negative log pseudolikelihood £(G) in ([7]) is jointly 
convex in Q. 

4. Conditional Independence and Penalty Terms 

In this section, we show how to incorporate edge selection into the maximum likelihood or 
pseudolikelihood procedures. In the graphical representation of probability distributions, 
the absence of an edge e = (u, v) corresponds to a conditional independency statement 



that variables x u and x v are conditionally independent given all other variables ( Roller and 



Friedman, 2009). We would like to maximize the likelihood subject to a penalization on 
the number of edges since this results in a sparse graphical model. In the pairwise mixed 
model, there are 3 type of edges 

1. (3 st is a scalar that corresponds to an edge from x s to x t . f3 s t = implies x s and 
xt are conditionally independent given all other variables. This parameter is in two 
conditional distributions, corresponding to either x s or xt is the response variable, 
P(x s \x\ s ,y;@) and p(x t \x\ t , y; 0). 

2. p s j is a vector of length Lj. If p s j(yj) = for all values of yj, then yj and x s 
are conditionally independent given all other variables. This parameter is in two 
conditional distributions, corresponding to either x s or yj is the response variable, 
p(x s \x\ s , y- 6) and p(yj\x, y\f, 6). 

3. 4> r j is a matrix of size L r x Lj. If (firjilJr, Vj) = for all values of y r and yj, then y r 
and yj are conditionally independent given all other variables. This parameter is in 
two conditional distributions, corresponding to either y r or yj is the response variable, 
p{Vr\x, y\ r ; ©) and p(yj\x, y^; @). 
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Figure 1: Symmetric matrix represents the parameters of the model. This example 
has p = 3, q = 2, L\ = 2 and L2 = 3. The red square corresponds to the continuous 
graphical model coefficients B and the solid red square is the scalar (3 st . The blue square 
corresponds to the coefficients p s j and the solid blue square is a vector of parameters p s j(-)- 
The orange square corresponds to the coefficients (p r j and the solid orange square is a matrix 
of parameters 4> r j(-, •)• 



For edges that involve discrete variables, the absence of that edge requires that the entire 
matrix (p r j or vector p s j is 0. This motivates the following regularized optimization problem 



minimize e ^A(@) = ^(®) + A 



£l[/3 st ^0]+^l[p si ^0]+5>[ 

s<t sj r<j 



'T3 



#0] 



All parameters that correspond to the same edge are grouped in the same indicator function. 
This problem is non-convex, so we replace the Iq sparsity and group sparsity penalties with 
the appropriate convex relaxations. For scalars, we use the absolute value (l± norm), for 
vectors we use the I2 norm, and for matrices we use the Frobenius norm. This choice 
corresponds to the standard relaxation from group Iq to group I1/I2 (group lasso) norm 



(Bach et al. 2011 Yuan and Lin. 2006). 



p s— 1 



1 3- 



minimize 4(e) = + A |/3 st | + £ £ || Psj || 2 + 

EE 



J T] || p 



(13) 



t=i t=i 



=1 3=1 



=1 r=l 



5. Calibrated regularizers 

In this section, we consider calibrating the regularization penalties for each set of param- 
eters. We introduce weights for each group of parameters and show how to choose the 
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weights such that each parameter set is treated equally under the fully-factorized indepen- 
dence model 



P s-l p q q j-l 

minimizee4(©) = ^(©) + A { ^^w st |/3 st | + ^ ^ w sj \\p s j\\ 2 + ^2^2 w rj \\4>rj\ 

4=1 4=1 s=l j=l j=l r=l 



Based on the KKT conditions, the parameter group is non-zero if 



(14) 



dfa 



> \w n 



where 9 g and w g represents one of the parameter groups and its corresponding weight. Thus 
for all parameters to be on equal footing, we would like to choose the weights w such that 



E 



PF 



d( 



d0„ 



constant x w„ 



However, it is simpler to compute in closed form E, 



PF 



dt 



, so we choose 



w„ 



'E 



PF 



d£ 



In Appendix [C} we show that the weights can be chosen as 

w st = a s a t 



w 



w 



r.J 



'^PaO- ~ Pa) ^2 ~ 9b) 



a s is the standard deviation of the continuous variable x s . p a = Pr(yj = a). For all 3 
types of parameters, the weight has the form of w uv = tr(cov(z M ))tr(cov(^)), where z 
represents a generic variable and cov(z) is the variance-covariance matrix of z. 



6. Optimization Algorithms 



In this section, we discuss two algorithms for solving (13). This is a convex optimization 



problem that decomposes into the form f(x) + g(x), where / is smooth and convex and g 
is convex but possibly non-smooth. In our case / is the negative log-likelihood or negative 
log-pseudolikelihood and g are the group sparsity penalties. 

Block coordinate descent is a frequently used method when the non-smooth function 
g is the l\ or group l\. It is especially easy to apply when the function / is quadratic, 



2. Under the independence model pf is fully-factorized p(x,y) = Yl^ = iP{ x s) Ylr=iP(V r ) 
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since each block coordinate update can be solved in closed form for many different non- 
smooth g (Friedman et al. 2007). The smooth / in our particular case is not quadratic, so 



each block update cannot be solved in closed form. However in certain problems (sparse 
the update can be approximately solved by using an appropriate inner 



inverse covariance 



optimization routine (Friedman et al. 2008b). 



6.1 Proximal Gradient 



Problems of this form are well-suited for the proximal gradient and accelerated proximal 



gradient algorithms as long as the proximal operator of g can be computed (Combettes and 



Pesquet, 2011 Beck and Teboulle, 2010) 



proxt(x) = argmin u — \\x 



+ g{u) 



(15) 



For the sum of I2 group sparsity penalties considered, the proximal operator takes the 
familiar form of soft-thresholding and group soft-thresholding. Since the groups are non- 
overlapping, the proximal operator simplifies to scalar soft-thresholding for (3 st and group 
soft-thresholding for p s j and <j) r j- 

The class of proximal gradient and accelerated proximal gradient algorithms is directly 
applicable to our problem. These algorithms work by solving a first-order model at the 
current iterate Xk 



axgm\n u f(x k ) + Vf(x k ) T (u - x k ) + — \\u - x k \\ 2 + g{u) (16) 

1 2 
= argmin u — \\u - (x k - tVf(x k ))\\ + g(u) (17) 

= prox t (x k - tVf(x k )) (18) 

The iteration is given by (19) 

x k+ \ = prox t (x k — tVf(x k )) where t is determined by line search 

(20) 



The TFOCS framework (Becker et al. 2011) is a package that allows us to experiment 



with 6 different variants of the accelerated proximal gradient algorithm. The TFOCS au- 
thors found that the Auslender- Teboulle algorithm exhibited less oscillatory behavior, and 
proximal gradient experiments in the next section were done using the Auslender- Teboulle 
implementation in TFOCS. 



6.2 Proximal Newton Algorithms 



This section borrows heavily from Schmidt (2010) and Schmidt et al. (2011). The class of 
proximal newton algorithms is a 2nd order analog of the proximal gradient algorithms. It 
attempts to incorporate 2nd order information about the smooth function / into the model 
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function. At each iteration, it minimizes a quadratic model centered at x k 

argmin u /(x fc ) + Vf{x k ) T (u -x k ) + —(u- x k ) T H{u - x k ) + g(u) (21) 

= argmin^ (u - x k + tH~ x V f (x k jf H (u - x k + tH^X/f^)) + g(u) (22) 

= argmin u — \\u - (x k - tH' 1 Vf(x k )) ||^ + g(u) (23) 
:= Hprox t (x k - tH' 1 Vf(x k )) where H = V 2 f(x k ) (24) 
The Hprox operator is analogous to the proximal operator, but in the ||-||^-norm. It 

Algorithm 1 Proximal Newton 
repeat 

Solve subproblem p k = Hproxt (x k — tH^ V f(x k )) — x k using TFOCS. 
Find t to satisfy Armijo line search condition with parameter a 

f(x k + tp k ) + g(x k + tp k ) < f(x k ) +g(x k ) - y \\p k \\ 2 

Set x k+ i = x k + tpk 

k = k + l 

until ^Y^Y 1 ' 1 < tol 
IN || 

simplifies to the proximal operator if H = I, but in the general case of positive definite H 
there is no closed-form solution for many common non-smooth g(x) (including l\ and group 
However if the proximal operator of g is available, each of these sub- problems can be 
solved efficiently with proximal gradient. In the case of separable g, coordinate descent is 
also applicable. Fast methods for solving the subproblem Hprox t (x k —tH~ 1 V f(x k )) include 



coordinate descent methods, proximal gradient methods, or Barzilai-Borwein ( 


Friedman 


et al. 


2007; 


Combettes and Pesquet 


2011 


Beck and Teboulle 


2010; 


Wright et al. 


2009 


)• 



The proximal Newton framework allows us to bootstrap many previously developed solvers 
to the case of arbitrary loss function /. 



Proximal Newton methods generally require fewer outer iterations (evaluations of Hprox) 
than first-order methods while providing higher accuracy because they incorporate 2nd 
order information. They are also widely applicable since the algorithm only needs access to 
the V/ and proximal operator of g, if each subproblem is solved using proximal gradient or 
Barzilai Borwein. Since the objective is quadratic, coordinate descent is also applicable to 
the subproblems. The hessian matrix H can be replaced by a quasi-newton approximation 
such as BFGS/L-BFGS/SR1. In our implementation, we use a L-BFGS approximation and 
solve the subproblem with TFOCS. 

6.3 Path Algorithm 

Frequently in machine learning and statistics, the regularization parameter A is heavily 
dependent on the dataset. A is generally chosen via cross-validation or holdout set per- 
formance, so it is convenient to provide solutions over an interval of [A m j n , A max ]. We 
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start the algorithm at Ai = \ m ax and solve, using the previous solution as warm start, for 
A2 > • • • > A m j n . We find that this reduces the cost of fitting an entire path of solutions 
(See Fi gure l3|) . Xmax can be chosen as the smallest value such that all parameters are by 
using the subgradient equations. 

7. Conditional Model 

We can generalize our mixed model to include a conditional model by incorporating fea- 
tures, similar to a conditional random field. Conditional models only model the conditional 
distribution p(z\f), as opposed to the joint distribution p(z, /), where z are the variables of 
interest to the prediction task and / are features. 

In addition to observing x and y, we observe features / and we build a graphical model 
for the conditional distribution p(x,y\f). Consider a full pairwise model p(x,y,f) of the 
form 0. We then choose to only model the joint distribution over only the variables x and 
y to give us p(x, y\f) which is of the form 

f p p I p p q 

p{x,y\f;Q) = z(Qlf) exp ^ ^2 ~2^ stXsXt + X) a ' x » + Y,^ Psj(Vj) x s 

\ l-'J \ s =l t=l S=l 8=1 j=l 

q j F p F q \ 

+ YY firjiVr, Vj)+YY llsXsfl + YY ^r(yr)fl J (25) 
j=l r =l 1=1 a=l 1=1 r=l J 

We can also consider a more general model where each pairwise edge potential depends on 
the features 

(p p * p 
Y Y ~2 l3st ^ XsXt + Y a '(f) x » 
s=l t=l s=l 

p q q j \ 

+ Y Y P'i(Vi> f)x 8 + Y Y fajiVr, Vj, f) I (26) 

s=l j=l 3=1 r=l I 



1 

p(x,y\f)@) = ^Q|yy ex P 



(25) is a special case of this where only the node potentials depend on features and the 
pairwise potentials are independent of feature values. The specific parametrized form we 
consider is 4> rj (y r ,yj, f) = <f> r j(y r ,yj) for r / j, p s j(yj,f) = Psj(yj), and /3 st (f) = (3 st . 
The node potentials depend linearly on the feature values, a s (f ) = a s + J2i=i lis x sfu and 

8. Experimental Results 

We present experimental results on synthetic data, survey data and on a conditional model. 
8.1 Synthetic Experiments 

In the synthetic experiment, the training points are sampled from a true model with 10 



continuous variables and 10 binary variables. The edge structure is shown in Figure 2a. A 
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(a) True Graph Structure 




1000 1500 
sample size 



(b) Probability of recovery 



Figure 2: Figure 2a shows the graph used in the synthetic experiments for p = q = 4. Blue 
nodes are continuous variables, red nodes are binary variables and the orange, green and 
dark blue lines represent the 3 types of edges. Figure 2b is a plot of the probability of 
correct edge recovery at a given sample size. Results are averaged over 100 trials. 



is chosen as 5y log P +q as suggested in several theoretical studies (Ravikumar et al. 



2010 



Jalali et al. 2011). We see from the experimental results that recovery of the correct edge 



set undergoes a sharp phase transition, as expected. With n = 1000 samples, we are recov- 



ering the correct edge set with probability nearly 1. This is expected since Meinshausen and 



Buhlmann (2006) showed that for sparse gaussian graphical models node-wise linear regres- 



sion recovers the true edge set. Similarly for discrete graphical models, Ravikumar et al. 



(2010) and Jalali et al. (2011) showed respectively that logistic regression, and (multiclass) 



logistic regression recover the true edge set under certain incoherence assumptions. The 
phase transition experiments were done using the proximal Newton algorithm discussed in 
Section 16.21 



8.2 Survey Experiments 

The survey dataset we consider consists of 11 variables, of which 2 are continuous and 9 
are discrete: age (continuous), log- wage (continuous), year (7 states), sex(2 states), marital 
status (5 states), race(4 states), education level (5 states), geographic region(9 states), job 
class (2 states), health (2 states), and health insurance (2 states). All the evaluations are 
done using a holdout test set of size 100, 000 for the survey experiments. The regularization 
parameter A is varied over the interval [5 x 10 -5 , .7] at 50 points equispaced on log-scale for 
all experiments. 



8.2.1 Model Selection 

In Figure [3j we study the model selection performance of learning a graphical model over 
the 11 variables under different training samples sizes. We see that as the sample size 
increases, the optimal model is increasingly dense, and less regularization is needed. 
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14.5 




1(T 5 1(T 4 1CT 3 1Cf 2 1CT 1 10° 

Regularization Parameter Log-Scale 



Figure 3: Model selection under different training set sizes. Circle denotes the lowest test 
set negative log pseudolikelihood and the number in parentheses is the number of edges in 
that model at the lowest test negative log pseudolikelihood. The saturated model has 55 
edges. 



8.2.2 Comparing against Separate Regressions 

A sensible baseline method to compare against is a separate regression algorithm. This 
algorithm fits a linear gaussian or (multiclass) logistic regression of each variable condi- 
tioned on the rest. We can evaluate the performance of the pseudolikelihood by evaluating 
— \ogp(x s \x\ s , y) for linear regression and — log p(y r \y\ r , x) for (multiclass) logistic regres- 
sion. Since regression is directly optimizing this loss function, it is expected to do better. 
The pseudolikelihood objective is similar, but has half the number of parameters as the 
separate regressions since the coefficients are shared between two of the conditional likeli- 
hoods. From Figures [4] and [5j we can see that the pseudolikelihood performs very similarly 
to the separate regressions and sometimes even outperforms regression. The benefit of the 
pseudolikelihood is that we have learned parameters of the joint distribution p(x, y) and not 
just of the conditionals p(x s \y, x\ s ). On the test dataset, we can compute quantities such 
as conditionals over arbitrary sets of variables p(yA, XB\yA c i x b c ) an d marginals p(xA,yB) 



(Roller and Friedman, 2009). This would not be possible using the separate regressions. 



8.2.3 Conditional Model 



Using the conditional model (25), we model only the 3 variables logwage, education(5) and 
jobclass(2). The other 8 variables are only used as features. The conditional model is then 
trained using the pseudolikelihood. We compare against the generative model that learns 
a joint distribution on all 11 variables. From Figure [6j we see that the conditional model 
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Full Model 



100 



Separate 
age 



Joint 



logwage 





Figure 4: Separate Regression vs Pseudo-likelihood n = 100. y-axis is the appropriate 
regression loss for the response variable. For low levels of regularization and at small training 
sizes, the pseudolikelihood seems to overfit less; this may be due to a global regularization 
effect from fitting the joint distribution as opposed to separate regressions. 
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Figure 5: Separate Regression vs Pseudo-likelihood n = 10, 000. y-axis is the appropriate 
regression loss for the response variable. At large sample sizes, separate regressions and 
pseudolikelihood perform very similarly. This is expected since this is nearing the asymp- 
totic regime. 
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Figure 6: Conditional Model vs Generative Model at various sample sizes, y-axis is test set 
performance is evaluated on negative log pseudolikelihood of the conditional model. The 
conditional model outperforms the full generative model at except the smallest sample size 
n = 100. 



outperforms the generative model, except at small sample sizes. This is expected since 
the conditional distribution models less variables. At very small sample sizes and small A, 
the generative model outperforms the conditional model. This is likely because generative 
models converge faster (with less samples) than discriminative models to its optimum. 

8.2.4 Maximum Likelihood vs Pseudolikelihood 

The maximum likelihood estimates are computable for very small models such as the con- 
ditional model previously studied. The pseudolikelihood was originally motivated as an 
approximation to the likelihood that is computationally tractable. We compare the max- 
imum likelihood and maximum pseudolikelihood on two different evaluation criteria: the 
negative log likelihood and negative log pseudolikelihood. In Figure [7J we find that the pseu- 
dolikelihood outperforms maximum likelihood under both the negative log likelihood and 
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Figure 7: Maximum Likelihood vs Pseudolikelihood. y-axis for top row is the negative log 
pseudolikelihood. y-axis for bottom row is the negative log likelihood. Pseudolikelihood 
outperforms maximum likelihood across all the experiments. 



negative log pseudolikelihood. We would expect that the pseudolikelihood trained model 
does better on the pseudolikelihood evaluation and maximum likelihood trained model does 
better on the likelihood evaluation. However, we found that the pseudolikelihood trained 
model outperformed the maximum likelihood trained model on both evaluation criteria. 
Although asymptotic theory suggests that maximum likelihood is more efficient than the 
pseudolikelihood, this analysis may not be applicable because of the finite sample regime 
and misspecified model (Liang and Jordan 2008). 
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Appendix A. Sampling Prom The Joint Distribution 

In this section we discuss how to draw samples (x,y) ~ p(x,y). Using the property that 
p(x,y) =p{y)p{x\y), we see that if y ~ p(y) and x ~ p(x\y) then (x,y) ~ p(x,y). We have 
that 

p(y) oc exp(YiMyr,yj) + \p{y) T B- 1 P {y)) (27) 

r,j 

(p(y)) s = Y^Psj(y 3 ) (28) 

j 

p(x\y)=No(B- 1 (a + p(y)),B- 1 ) (29) 

The difficult part is to sample y ~ p(y) since this involves the partition function of the 
discrete MRF. This can be done with MCMC for larger models and junction tree algorithm 
or exact sampling for small models. 

Appendix B. Maximum Likelihood 

The joint distribution and loglikelihood are: 



p(x,y;Q) = exp(--x T Bz + (a + p{y)) T x + ^ <p r j(y r , yj))/Z(Q) 

(r,j) 

£(Q) = - x T Bx - (a + p(y)fx - ^ <t>rj{Vr, Vj) 

\ (rj) / 

+ log(^ j dxexp(-^x T Bx + (a + p(y')) T x)exp(J2 c l ) rj(.y' r ,yj))) 



The derivative is 



dB 2 



l xxT I dx (J2 y ' ~ \xx T exp(- \x T Bx + (a + p{y)) T x + £ (rJ) ^{y^y'j))) 



Z(@) 



= -xx T + J ^(--xx T p(x,y';G)) 



-xx 



+ I -^ xxT p( x \y'^ Q )p(y') 



\ xx t + (B- 1 + B~\a + P {y')){a + piy'f)^ 1 ) P {y') 
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The primary cost is to compute B 1 and the sum over the discrete states y. 

The computation for the derivatives of £(Q) with respect to p s j and 4> r j are similar. 

^ ^ ^ = -HVr = a,yj = b) + J dxl (y'r = a > v'j = b M x , v'\ e ) 
= -l(y r = a, yj = b) + ^ l(y' r = a, y] = b)p{y') 



The gradient requires summing over all discrete states. 
Similarly for p s j(a): 

— T\ = = a ) Xs + E / dx ( l (y'j = a)x s )p(x',y';G) 

Psj(a) J 

= = a)x s + J dx^Xspixly'^^y'j = a)p{y'^,y' j = a) 

MLE estimation requires summing over the discrete states to compute the expected sufficient 
statistics. This may be approximated using using samples (x,y) ~ p(x,y;Q). The method 
in the previous section shows that sampling is efficient if y ~ p(y) is efficient. This allows 
us to use fast MCMC methods developed for discrete MRF's such as the Swendsen-wang 
method. 



Appendix C. Choosing the Weights 

We first show how to compute w s j. The gradient of the pseudo-likelihood with respect to 
a parameter p s j(a) is given below 

d£ 



dp~(a) = S " 2 X 1 & = a ] x » + EpMyj = a l x Mi> x ^ + E pfWV3 = a]x s \x\ s , y { ) 

n 

= £-2xl [y] = a] x\ + x\p{y 3 = a) + 1 [y] = a] p s 

i=l 
n 

= E 1 & = a ] (A- " x s) + < {p(Vj = a) - 1 [y] = a] ) 



i=i 



£ i 1 [yj = a \~ pfa = a )) (a- - x D + {< - a*) 0% = a)-i[ y ) = a\) 
i=i 

(30) 

n 

E 2 i 1 M = °] - p(yj = «)) (a- - *») ( 31 ) 
i=i 
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Since the subgradient condition includes a variable if 
By independence, 



dl 



> A, we compute E 



dl 



E 



PF 



E 2 i 1 [yj = a \- ftvj = °)) - x s) 



i=l 



= 4nE PF (||l [y] = a] - p( Vj = a)f) E PF (\\jl s - xlf) 
= 4(n - l)p( yj = o)(l - p(j/j = a))a 2 s 



(32) 

(33) 
(34) 



The last line is an equality if we replace the sample means p and fi with the true values p and 

2 



\i. Thus for the entire vector p s j we have E PF 



dl 



dp s 



4(n-l) (E a p(yj = o)(l - = a)) cr 



If we let the vector z be the indicator vector of the categorical variable yj , and let the vector 

= 4(n - 1) EaPal 1 - Pa)<r 2 = 4 ( n - l)tr(cov(z))var(x) 



dps 



V = PiVj = a), then E PF 
and w sj = \JY, a Pa,( l -Pa)o1- 

We repeat the computation for f} st . 



dl 

W7t 



-2x\x t + E PF (x l s xl\x\ s ,y) + E PF {x l s x\\x\ u y) 

i=l 

n 

- 2x >i + m + fox*, 

i=l 

n 

Y4i^-xi)+xi{fit-xi) 
i=i 

n 

Y(4 - fo){$8 - X\) + (xl - lls){fit - X\) 
i=l 
n 

^{4- M^s- xi) 



i=l 



Thus 



E 



s X s 



i=l 



= AnE PF \\x t - fit\\ 2 E PF \\x s - rriu s \\ 2 

2_2 



= A{n-\)aiai 
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Thus E, 



VF 



dt 



4(n — l)cj 2 s at and taking square-roots gives us w s t = o s ot- 



We repeat the same computation for <j) r j. Let p a = Pr(y r = a) and % = Pr(yj = b). 

dl_ 

d(f> rj (a,b) 



= ^2-l[yi. = a]l[y) = b]+E(l [y r = a]l[ yj = b]\y v ,x) 
i=i 

+ E(t[y r = a]t[y j = b}\y\ j ,x) 

n 

= J2-l[yi = a]t[y} = b] +p a t[y) = b] + q b t[y l r = a] 
i=i 

n 

= E 1 ^ = b ](P" -l[yi = a]) + t[yi = a](q b - 1 [y] = b]) 
i=i 

= E^ 1 M = b ] ~ ®)(P« - 1 [fr = «] ) + (1 [i/j = «] " P«)(& " 1 [Vj = 
i=l 
n 

= £ 2(1 [y} = 6] = a]) 



i=i 

Thus we compute 







"-«( 


d4>rj{a, b) 



E 2 (i[j/} = 6] -96)(p«-i[y; = o]) 



1=1 



4n£ PF ||% - = 6] || 2 £ PF ||p a - lh> = a] 
4(n - 1)%(1 - ?6)p (l - Pa) 



From this, we see that i?, 



PF 



Ea=l Efei 4 (« " - %)Pa{l ~ Pa) and 



Ea=l Ebil - ?ft)Po(l - Pa) 
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